I will plot all the variables to see what their distrubtions are like. Density plots are better suited than histograms to continuous variables such as these.
This is the structure of the data,
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
and these are summaries of each variable in the data set.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600 Min. :0.00900
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700 1st Qu.:0.03600
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200 Median :0.04300
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391 Mean :0.04577
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900 3rd Qu.:0.05000
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800 Max. :0.34600
##
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
## Min. : 2.00 Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00 3: 20
## 1st Qu.: 23.00 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 4: 163
## Median : 34.00 Median :134.0 Median :0.9937 Median :3.180 Median :0.4700 Median :10.40 5:1457
## Mean : 35.31 Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51 6:2198
## 3rd Qu.: 46.00 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40 7: 880
## Max. :289.00 Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20 8: 175
## 9: 5
The main features of interest in the white wine dataset are quality and alcohol. These are the two most likely reasons why people buy and drink wine in the first place.
Theory suggests that sulphates and residual sugar have strong influences on the quality and alcohol content of a wine. I hope to investigate this theory.
I changed the nature of quality from an integer to a factor. This is a table of the results.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
There was an unusual distribution for those wines whose quality was ranked at 9. However, this was due to the fact that there were only five of these in the sample, and this caused the distortion. pH is a nearly-normal distribution of median value 3.18 and average value 3.188. The rest of the data are heavily right-skewed, with low mean and median values.
The first feature noticed was how the small sample size (five) for wines of Quality 9 distorted the results. As such, these five were removed to generate the graphs in this part of the report.
There is considerable variance between quality and alcohol, suggesting that the quality of a wine is dependent on its alcohol content.
The strongest relationship I found was that between density and alcohol, a relationahip which has a Pearson’s Coefficient of -0.7801376.
Alcohol is a feature of quality wine. Alcohol content increases with quality, but certainly not in any mathematical sense.
Acidity or baseness makes no different to alcohol content, irrespective of quality. It is hard to make a strong mathematical case for any strong relationships in the dataset, other than the inverse correlation between density and alcohol.
The relationship between residual sugar and alcohol is easier to see when residual sugar is plotted on the y-axis and alcohol on the x. This is counter-intuitive, as sugar is an independent variable and alcohol dependent in wine-making.
I did not create any model with the dataset. The only feature of the dataset that could be modeled is the inverse correlation between density and alcohol. That relationship is only of interest to chemists, and chemists are probably already aware of it. There is no correlation extant between other variables that is strong enough from which a model could be created.
This graph plots density against alcohol for the sample data. The points on the scatter plot are set at 0.1 to reduce over-plotting. A trendline is added to show the strong negative corrrelation between density and alcohol using the stat_smooth() function.
Residual sugar is plotted against alcohol in a scatterplot graph. The points are colored according to their quality to add further information to the graph, and are set at alpha = 0.5 to reduce over-plotting.
This is a graph of the alcohol quantity in wine broken down by the quality of the wines in the sample set. The faceting of the graph makes it easy to compare one wine’s alcohol quantity to another’s. Strong colors are used in the graph to make the data jump out.
The only strong correlation in this dataset is between density and alcohol, which is not a factor when people are buying wine. While this is a disappointment to statisticians, it is almost certainly good news for sommeliers, who can be reassured that their specialty is indeed more art than science.
This was not an easy dataset for someone who is neither a chemist nor an oenophile to work with. I studied chemistry in high school, but figuring out the relationships between the variables was the child of research and guesswork.
The absence of units of measurement is a drawback to the dataset – what are these variables measured in? Sulfur is measured in parts per million – does this apply to chlorides, sugars and acids too? What is the basis of the quality scale?
Quality was also a difficult factor to deal with as the sample is so dominated by wines of Quality 6.